UCB and InfoGain Exploration via $\boldsymbol{Q}$-Ensembles

نویسندگان

  • Richard Y. Chen
  • Szymon Sidor
  • Pieter Abbeel
  • John Schulman
چکیده

We show how an ensemble ofQ-functions can be leveraged for more effective exploration in deep reinforcement learning. We build on well established algorithms from the bandit setting, and adapt them to the Q-learning setting. We propose an exploration strategy based on upper-confidence bounds (UCB). Our experiments show significant gains on the Atari benchmark.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Lower PAC bound on Upper Confidence Bound-based Q-learning with examples

Abstract Recently, there has been significant progress in understanding reinforcement learning in Markov decision processes (MDP). We focus on improving Q-learning and analyze its sample complexity. We investigate the performance of tabular Q-learning, Approximate Q-learning and UCB-based Q-learning. We also derive a lower PAC bound Ω( |S| |A| 2 ln |A| δ ) of UCB-based Q-learning. Two tasks, Ca...

متن کامل

The Relationship Between High-Order Aberration and Anterior Ocular Biometry During Accommodation in Young Healthy Adults

Purpose This study investigated the anterior ocular anatomic origin of high-order aberration (HOA) components using optical coherence tomography and a Shack-Hartmann wavefront sensor. Methods A customized system was built to simultaneously capture images of ocular wavefront aberrations and anterior ocular biometry. Relaxed, 2-diopter (D) and 4-D accommodative states were repeatedly measured i...

متن کامل

Gaseous Reservoir Horizons Determination via Vp/Vs and Q-Factor Data, Kangan-Dalan Formations, in One of SW Iranian hydrocarbon Fields

An important method in oil and gas exploration is vertical seismic profile (VSP) to estimate rock properties in drilling well. Quality factor is also the crucial point of seismic attenuate in VSP data. In the present study, this factor was used to evaluate the hydrocarbon potential of Kangan Formation in one of the Persian Gulf oil fields using VSP zero offset method. Quality factor was estimat...

متن کامل

Optimization as Estimation with Gaussian Processes in Bandit Settings

Recently, there has been rising interest in Bayesian optimization – the optimization of an unknown function with assumptions usually expressed by a Gaussian Process (GP) prior. We study an optimization strategy that directly uses an estimate of the argmax of the function. This strategy offers both practical and theoretical advantages: no tradeoff parameter needs to be selected, and, moreover, w...

متن کامل

Collaborative Spatial Reuse in Wireless Networks via Selfish Multi-Armed Bandits

Next-generation wireless deployments are characterized by being dense and uncoordinated, which often leads to inefficient use of resources and poor performance. To solve this, we envision the utilization of completely decentralized mechanisms that enhance Spatial Reuse (SR). In particular, we concentrate in Reinforcement Learning (RL), and more specifically, in Multi-Armed Bandits (MABs), to al...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • CoRR

دوره abs/1706.01502  شماره 

صفحات  -

تاریخ انتشار 2017